Measurements of Spoken Language Variability in a Multilingual Corpus. Predictable Aspects
نویسنده
چکیده
The paper provides cross-linguistic measurements of everyday language use based on the C-ORAL-ROM multilingual corpus of spontaneous speech. The average and the variation coefficient of a series of standard parameters are provided, faced to the main sociological and structural contexts of spoken language use. Mid-Length of Utterances (MLU); Mid-Length of the dialogic turn (MLTw); Speed; Mid length of the tone unit (MLTone); Fragmentation. Such variation parameters show strong predictable characters at cross-linguistic level. MLU has a positive correlation with MLTw and is shows highly predictable values in informal dialogic structures. Both MLU and MLTw have an inverse correlation with Speed. MLTone and Speed are predictable according to language specific features, but while MLTone have low intra-linguistic variation, Speed record a cross-linguistic tendency to lower values in formal language uses. Fragmentation is a permanent feature of spoken language, but it varies mainly according with speakers.
منابع مشابه
Multilingual Spoken Language Corpus Development for Communication Research
Multilingual spoken language corpora are indispensable for research on areas of spoken language communication, such as speech-to-speech translation. The speech and natural language processing essential to multilingual spoken language research requires unified structure and annotation, such as tagging. In this study, we describe an experience with multilingual spoken language corpus development ...
متن کاملVague Language and Interpersonal Communication: An Analysis of Adolescent Intercultural Conversation
This paper is concerned with the analysis of the spoken language of teenagers, taken from a newly developed specialised corpus the British and Taiwanese Teenage Intercultural Communication Corpus (BATTICC). More specifically, the study employs a discourse analytical approach to examine vague language in an intercultural context among a group of British and Taiwanese adolescents, paying particul...
متن کاملThe Development of the Multilingual LUNA Corpus for Spoken Language System Porting
The development of annotated corpora is a critical process in the development of speech applications for multiple target languages. While the technology to develop a monolingual speech application has reached satisfactory results (in terms of performance and effort), porting an existing application from a source language to a target language is still a very expensive task. In this paper we addr...
متن کاملMultilingual Aspects of Monolingual Corpora
If someone would collect opinions among the computational linguists what had been the most important trend in linguistics in the last decade, it is highly probable that the majority would answer that it was the massive use of large natural language corpora in many linguistic fields. The concept of collecting large amounts of written or spoken natural language data has become extremely important...
متن کاملMultilingual corpora for speech-to-speech translation research
Multilingual spoken language corpora are indispensable for developing new speech-to-speech machine translation (S2SMT) technologies. This paper first discusses characteristics that corpora for S2SMT should have, then surveys existing corpora. Finally, it compares these corpora.
متن کامل